RRBS porcine project analysis pipeline

This report contains information regarding the analysis pipeline for Alexey Fedulov’s rrbs analysis’

Overview: 22 samples of Sus scrofa (pig) were received as DNA. Of the 22 samples, samples 9, 13, 17, and 20 were not sequenced, thereby leaving 18 samples. The samples were subjected to enzymatic digestion (MSP1 + TaqI), library preparation, bisulfite conversion, Bioanalyzer QC, KAPA library quantification, multiplex NGS on an Illumina HiSeq4000 and advanced data analysis.

The following steps were performed in the analysis pipeline:

Part 1: Initial QC

Initial QC analyses of raw reads was performed in FastQC (V0.11.5). To remove overrepresented sequences, adapters, and low-quality sequences, trimming was performed via a two-step process. First, trimming was performed using trim galore with the following parameters:

trim_galore --quality 20 --clip_R1 6 --adapter AGATCGGAAGAGC --output_dir ${dir} --rrbs $sample 

Then on the trimmed reads, trimming was once again performed in trimmomatic using the following parameters:

- trimmomatic:
  subcommand: SE
options:
  " ILLUMINACLIP:/gpfs/data/cbc/cbc_conda_v1/envs/cbc_conda/opt/trimmomatic-0.36/adapters/TruSeq3-SE.fa:2:30:5:6:true":
  "SLIDINGWINDOW:4:20 MINLEN:35":

Part 2: Read quality plots

Library Raw reads Trimmed reads
Rh1_R1_001
Rh2_R1_001
Rh3_R1_001
Rh4_R1_001
Rh5_R1_001
Rh6_R1_001
Rh7_R1_001
Rh8_R1_001
Rh10_R1_001
Rh11_R1_001
Rh12_R1_001
Rh14_R1_001
Rh15_R1_001
Rh16_R1_001
Rh18_R1_001
Rh19_R1_001
Rh21_R1_001
Rh22_R1_001

Part 3: GC content plots

Library Raw reads Trimmed reads
Rh1_R1_001
Rh2_R1_001
Rh3_R1_001
Rh4_R1_001
Rh5_R1_001
Rh6_R1_001
Rh7_R1_001
Rh8_R1_001
Rh10_R1_001
Rh11_R1_001
Rh12_R1_001
Rh14_R1_001
Rh15_R1_001
Rh16_R1_001
Rh18_R1_001
Rh19_R1_001
Rh21_R1_001
Rh22_R1_001

Part 4: Read alignment

bismark -o /gpfs/data/cbc/fedulov_alexey/porcine_rrbs/alignments/update --bowtie2 --genome /gpfs/data/shared/databases/refchef_refs/S_scrofa/primary/bismark_index/ --un --pbat $trimmed_read

Part 5: Alignment summary statistics

LIBRARY NUMBER_READS MAPPED_READS UNMAPPED_READS DUPLICATED_READS DUPLICATION_RATE
Rh1_R1_001 10,845,489 10,845,489 / 100% 0 / 0% 1,437,581 / 13.26% 12.6%
Rh2_R1_001 12,892,543 12,892,543 / 100% 0 / 0% 1,496,670 / 11.61% 10.93%
Rh3_R1_001 8,715,481 8,715,481 / 100% 0 / 0% 1,029,908 / 11.82% 11.02%
Rh4_R1_001 9,968,480 9,968,480 / 100% 0 / 0% 1,209,925 / 12.14% 11.51%
Rh5_R1_001 13,925,551 13,925,551 / 100% 0 / 0% 2,056,379 / 14.77% 13.91%
Rh6_R1_001 13,362,404 13,362,404 / 100% 0 / 0% 1,750,619 / 13.1% 12.31%
Rh7_R1_001 12,851,559 12,851,559 / 100% 0 / 0% 1,713,430 / 13.33% 12.48%
Rh8_R1_001 12,872,887 12,872,887 / 100% 0 / 0% 1,769,849 / 13.75% 12.93%
Rh10_R1_001 15,060,158 15,060,158 / 100% 0 / 0% 2,107,684 / 14% 13.11%
Rh11_R1_001 13,432,863 13,432,863 / 100% 0 / 0% 1,930,952 / 14.37% 13.51%
Rh12_R1_001 14,880,989 14,880,989 / 100% 0 / 0% 2,381,186 / 16% 14.75%
Rh14_R1_001 13,023,116 13,023,116 / 100% 0 / 0% 1,938,398 / 14.88% 13.62%
Rh15_R1_001 13,292,288 13,292,288 / 100% 0 / 0% 1,992,503 / 14.99% 13.72%
Rh16_R1_001 17,058,927 17,058,927 / 100% 0 / 0% 2,756,384 / 16.16% 15.03%
Rh18_R1_001 12,196,440 12,196,440 / 100% 0 / 0% 1,814,764 / 14.88% 13.42%
Rh19_R1_001 13,248,129 3,248,129 / 100% 0 / 0% 1,901,092 / 14.35% 13.16%
Rh21_R1_001 16,030,293 16,030,293 / 100% 0 / 0% 2,624,856 / 16.37% 14.84%
Rh22_R1_001 13,927,765 13,927,765 / 100% 0 / 0% 1,961,472 / 14.08% 13.05%
LIBRARY COVERAGE_MEAN COVERAGE_SD MEAN_MAP_QUALITY
Rh1_R1_001 0.1844 0.7664 28.42
Rh2_R1_001 0.2191 0.8243 29.32
Rh3_R1_001 0.1482 0.7405 27.77
Rh4_R1_001 0.1694 0.7166 28.62
Rh5_R1_001 0.2368 0.943 28.68
Rh6_R1_001 0.2271 0.9149 29.01
Rh7_R1_001 0.2185 0.8851 28.47
Rh8_R1_001 0.2188 0.9475 29.15
Rh10_R1_001 0.2561 1.0472 28.59
Rh11_R1_001 0.2284 0.9772 28.47
Rh12_R1_001 0.253 1.0263 29.06
Rh14_R1_001 0.2214 0.9804 29.35
Rh15_R1_001 0.226 0.9128 28.77
Rh16_R1_001 0.29 1.2004 29.18
Rh18_R1_001 0.2075 1.0798 28.59
Rh19_R1_001 0.2252 0.9003 29.01
Rh21_R1_001 0.2726 1.2379 29.02
Rh22_R1_001 0.2368 1.1062 29.13

Part 6: Methylation Extracor

Bismark methylation extractor was used to extract methylation information using the following parameters:

bismark_methylation_extractor --bedGraph --comprehensive -s --merge_non_CpG --report --output /gpfs/data/cbc/fedulov_alexey/porcine_rrbs/alignments/update/ --gzip --multicore 8 --genome_folder /gpfs/data/shared/databases/refchef_refs/S_scrofa/primary/bismark_index/ $sample  

This produced m-bias plots, illustrating the methylation proportion across each possible position in the read (see Part 7 below).

Part 7: M-bias plots

Library M-bias plot
Rh1_R1_001
Rh2_R1_001
Rh3_R1_001
Rh4_R1_001
Rh5_R1_001
Rh6_R1_001
Rh7_R1_001
Rh8_R1_001
Rh10_R1_001
Rh11_R1_001
Rh12_R1_001
Rh14_R1_001
Rh15_R1_001
Rh16_R1_001
Rh18_R1_001
Rh19_R1_001
Rh21_R1_001
Rh22_R1_001

Part 8: Filtering and GLM (Analysis)

Genes were filtered such that only CpGs with at least 5 counts (methylated and unmethylated) in 3 samples were included for downstream analysis. Additionally, CpGs that were never methylated or always methylated were filtered out, as these provide no information about differential methylation. The resulting sample size after filtering was 84,999.

The glmFit function was used to fit a negative binomial generalized log-linear model. The experimental design matrix was constructed using modelMatrixMeth with a factorial experimental design (~0 + group), where group was a factor variable with levels comprised of each combination of treatment/diet/tissue. We dropped the intercept from our model to parameterize it as a means model. Subsequently, the glmLRT function was used to find differentially methylated loci for comparisons of interest, which were made by constructing contrast vectors. Individual CpG sites were considered differentially methylated if the nominal p-value was < 0.02.